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@ Data processor for variable width operands. 

@ A pipelined or superscalar processor (10) that executes operations utilizing operand data of variable 
bit widths improves parallel perfomnance by partitioning a fixed bit width operand (200) into several 
partial operand fields (215, 216 and 217). and checking for data dependencies, tagging and forwarding 
data in these fields independentiy of one another. An instmction decoder (18) concurrentiy dispatches 
multiple ROPs to various functional units (20, 21, 22 and 80). Conflicts which arise with respect to register 
resources are resolved through register renaming. However, implementation of register renaming is 
difficult when register structures are overlapping. The present invention supports independent depen- 
dency checking, tagging and forwarding of partial bit fields of a register operand which, in combination, 
allow renaming of registers. Therefore, the variable width register operand stixicture greatly assists the 
processor to resolve data dependencies. Operands are tagged by a reorder buffer (26) and supplied with 
data when it becomes available without regard for the type of data. This method of dependency 
resolution supports parallel perfonmance of operations and provides a substantial improvement in 
overall speed of processing. Thus, the processor promotes parallel processing of operations that act 
upon overlapping data structures which otherwise resist parallel handling. 
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The present invention relates to a processor ar- 
chitecture, and more particularly to a processor archi- 
tecture for handling variable bit width data. 

Processors generally process a single instruction 
of an instruction set in several steps. Early technolo- 5 
gy processors performed these steps serially. Ad- 
vances in technology have led to pipelined-architec- 
ture processors, also called scalar processors, which 
perform different steps of many instructions concur- 
rently. A "superscalar" processor Is implemented us- io 
ing a pipelined structure, but improves performance 
by concurrently executing scalar instructions. 

In a superscalar processor, instruction conflicts 
and dependency conditions arise in which an issued 
instruction cannot be executed because data or re- is 
sources are not available. For example, an issued in- 
struction cannot execute when its operands are de- 
pendent upon data calculated by other nonexecuted 
Instructions. 

Superscalar processor performance is improved 20 
by the speculative execution of branching instruc- 
tions and by continuing to decode instructions regard- 
less of the ability to execute instructions immediately. 
Decoupling of instruction decoding and instruction 
execution requires a buffer between the processor's 25 
instruction decoder and functional units that execute 
instructions. 

Performance of a superscalar processor is also 
improved when multiple concurrently-executing in- 
structions are allowed to access a common register 30 
However, this inherently creates a resource conflict. 
One technique for resolving register conflicts is called 
"register renaming". Multiple temporary renaming 
registers are dynamically allocated, one for each in- 
struction that sets a value for a permanent register. 35 
In this manner, one permanent register may serve as 
a destination for receiving the results of multiple in- 
structions. These results are temporarily held in the 
multiple allocated temporary renaming registers. The 
processor keeps track the renaming registers so that 40 
an instruction that receives data from a renaming reg- 
ister accesses the appropriate register. This register 
renaming function may be implemented using a reor- 
der buffer which contains temporary renaming regis- 
ters. 45 

Many existing processors run a large base of 
computer programs but are limited in performance. To 
improve instruction throughput in such processors, it 
may be desirable to Incorporate superscalar capabil- 
ities therein. W.M. Johnson in Superscalar Processor so 
Design, Englewood Cliffs, N.J., Prentice Hall. 1991, 
p. 261-272, discusses such a superscalar implemen- 
tation. 

For example, a family of processors, called the 
x86 family, have been developed including 8086, ss 
80286, 80386, 80486 and Pentium™ (Intel Corpora- 
tion. Santa Clara. CA.) processors. Advantageously, 
x86 processors are backward compatible. The new- 



est processors run the same programs as older proc- 
essors. x86 processors are considered to employ a 
complex-instruction-set-computer (CISC) architec- 
ture. In which many different densely-coded instruc- 
tions are implemented. 

A variety of techniques have been used in the x86 
family to implement backward compatibility. These 
techniques have made the Implementation of register 
renaming very difficult For example, the x86 instruc- 
tk>n set uses registers for which at least a subset of 
bits overlap the bits of another register, such as word 
registers that overlap double word registers and byte 
registers that overlap word and doubleword registers. 
As x86 processors evolved from 8 to 16-bit and then 
to 32-bit processors, the register architecture similar- 
ly evolved into a form in which 8- bit general registers 
AH and AL, respectively, comprise the high and low 
bytes of a 16-bit general register AX. AX, in turn, in- 
cludes the low order 16 bits of a 32-bit extended gen- 
eral register EAX. B, C and D registers are similarly 
constrained. These registers are supplemented by 
additional register pairs: ESIiSI, EDI.DI, ESP:SP and 
EBPrBP, having low order bits of the 32-bit extended 
(E) doubleword registers overlapped by 16-bit word 
registers. In addition, x86 processors have an exten- 
sive and complicated instruction set that introduces 
additional complexity so that some Instruction op- 
code fields that specify overlapping registers for 
some data widths also specify nonover lapping regis- 
ters of other data widths. 

If registers cannot be renamed, register access 
conflicts are resolved only by having one instruction 
cede control to another, delaying the dispatch of an 
instruction until the instruction is free of dependen- 
cies and causing stalling of the parallel dispatching of 
instructions in the processor pipeline. This causes 
serial operatk>n of instructions that are intended to be 
executed in parallel. 

Because the x86 architecture includes a small 
number of registers (eight), frequent register reusage 
is encouraged for a superscalar processor that Is in- 
tended to execute instructions in parallel. It is thus de- 
sirable to allow register reusage, perhaps by employ- 
ing register renaming. Unfortunately, the overlapping 
nature of x86 instructions limits the renaming of over- 
tapped registers for resolving mutual data dependen- 
cies. Register renaming is impeded because, al- 
though the overlap relationship of registers is known 
and invariable and thus predictable, architectural and 
code-compatibility constraints require that the regis- 
ters be considered independent entities. Thus, al- 
though register renaming could resolve register re- 
source conflicts in an x86 processor, the x86 architec- 
ture substantially limits register renaming. 

it is fundamental to achieving a performance im- 
provement using parallel processing that instructions 
be dispatched regularly and rapidly. When dispatch- 
ing of instructions is stalled awaiting execution of an- 
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other instruction, the processor performs only as well 
as a serial processor. 

Accordingly, it is desirable in a superscalar or pi- 
pelined processor to provide an improved processor 
architecture that supports renaming of overlapping 5 
register bitfields. Also, it is desirable in a superscalar 
or pipelined processor to provide dependency check- 
ing and forwarding of overlapping register bit fields. 
We shall describe a processor that employs partial 
register renaming to achieve dependency checking io 
and result forwarding of overlapping data fields. 

Embodiments of the present invention solve 
problems arising from parallel processing of overlap- 
ping data structures by providing an apparatus and 
method for performing operations upon variable size is 
operands which include characterizing the variable- 
sized operands of each operation as several fields 
within a full size operand. Each field is designated as 
defined or undefined with respect to the instruction. 
The apparatus and method also determine whether 20 
the operation is data dependent upon the execution 
of another operation for each defined field of the va- 
riable size operand, independently of the other fields. 
When data becomes available, variable size data are 
forwarded for utilization in the operation for each de- 25 
fined field independently of the other fields. 

A further embodiment of the present invention 
solves problems arising from parallel processing of 
overlapping data structures by providing a processor 
for performing operations utilizing operand data of a 30 
variable size. The processor includes an instruction 
decoder for decoding instructions that utilize variable 
size operands which are partitioned from a full size 
operand into several fields. The decoder designates 
each field as defined or undefined with respect to the 35 
instruction. The processor also includes a reorder 
buffer for temporarily storing control information and 
operand data relating to the operation and for deter- 
mining whether variable size operand data utilized by 
the operatk>n are available. Availability is determined 40 
for each defined field independently of the other 
fields. A plurality of functional units are supplied that 
execute operations and generate variable-sized re- 
sult data for each defined field independently of the 
other fields. Several busses are included for access- 45 
ing variable size operand data for utilization by a func- 
tional unit which executes an operation when the data 
upon which it is dependent becomes available. Avail- 
ability of data is tested for each defined field indepen- 
dently of the other fields. 50 

In the accompanying drawings, by way of exam- 
ple only; 

Figure 1 is a architecture-level block diagram of 
a processor which implements dependency 
checking and forwarding of variable width oper- 55 
ands; 

Figures 2, 3, 4, 5 and 6 are tables that depict mul- 
tiple bit fields within an operand utilized by oper- 



ations performed by the processor of Figure 1; 
Figure 7 depicts control bits of a pointer which 
selects a register of a register file; 
Figure 8 is an architecture-level block diagram of 
a register file within the processor of Figure 1; 
Figure 9 is a pictorial illustration of a memory for- 
mat in the register file shown in Figure 8; 
Figure 10 is an architecture-level block diagram 
of a reorder buffeir within the processor of Figure 
1; 

Figure 11 is a pictorial Illustration of a memory 
format within the reorder buffer of Figure 10; 
Figure 12 is a table that depicts dispatching of an 
exemplary sequence of instructions using a na- 
ive implementation of register renaming; 
Figure 13 is a table that depicts dispatching of an 
exemplary sequence of instructions using a pre- 
ferred implementation of register renaming; 
Figure 14 is an architectural-level block diagram 
of a generic functional unit which illustrates input 
and output handling of such a unit; and 
Figure 15 is a pictorial illustration of a FIFO for- 
mat within a reservatk>n station of the functional 
unit of Figure 14; 

Figure 16 is an architectural- level block diagram 
of a load / store functional unit within the proces- 
sor of Figure 1. 

The architecture of a superscalar processor 10 
having an instruction set for executing integer and 
floating point operations is shown in Figure 1. A 64- 
bit internal address and data bus 11 communicates 
address, data, and control transfers among various 
functional blocks of the processor 10 and an external 
memory 14. An instruction cache 16 parses and pre- 
decodes CISC instructions. A byte queue 35 transfers 
the predecoded instructions to an instruction decod- 
er 18, which maps the CISC instructions to respective 
sequences of instructions for RISC-like operations 
("ROPs"). The instruction decoder 18 generates type, 
opcode, and pointer values for all ROPs based on the 
pre-decoded CISC instructions in the byte queue 35. 

A suitable instruction cache 16 is described in 
further detail in our copending patent application 
94306870.0. A suitable byte queue 35 is described in 
additional detail in our copending patent application 
94306873.4. A suitable instruction decoder 18 is de- 
scribed in further detail in our copending patent ap- 
plication 94306884.1. Each of these patent applica- 
tions is incorporated herein by reference In its entire- 
ty. 

The instruction decoder 18 dispatches ROP op- 
erations to functional blocks within the processor 10 
over various busses. The processor 10 supports four 
ROP issue, five ROP results, and the results of up to 
sixteen speculatively executed ROPs. Up to four sets 
of pointers to the A and B source operands and to a 
destination register are furnished over respective A- 
operand pointers 36, B-operand pointers 37 and des- 
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tination register pointers 43 by the instruction decod- 
er 18 to a register file 24 and to a reorder buffer 26. 
The register file 24 and reorder buffer 26 In turn fur- 
nish the appropriate "predicted executed" versions of 
the RISC operands A and B to various functional units 5 
on four pairs of 41-bit A-operand busses 30 and B-op- 
erand busses 31. Associated with the A and B-oper- 
and busses 30 and 31 are operand tag busses, includ- 
ing four pairs of A-operand tag busses 48 and B-op- 
erand tag busseis 49. When a result is unavailable for io 
placement on an operand bus, a tag that identifies an 
entry In the reorder buffer 26 for receiving the result 
when it becomes available Is loaded onto a corre- 
sponding operand tag bus. The four pairs of operand 
and operand tag busses correspond to four ROP dis- is 
patch positions. The Instruction decoder, in cooper- 
ation with the reorder buffer 26, specifies four destin- 
ation tag busses 40 for identifying an entry In the re- 
order buffer 26 that will receive results from the func- 
tional units after an ROP Is executed. Functional 20 
units execute an ROP, copy the destination tag onto 
one of five result tag busses 39, and place a result on 
a corresponding one of five result busses 32 when the 
result Is available. Afunctional unit directly accesses 
a result on result busses 32 when a corresponding tag 25 
on result tag busses 39 matches the operand tag of 
an ROP awaiting the result. 

The Instruction decoder 18 dispatches opcode in- 
formation, including an opcode and an opcode type, 
that accompanies the A and B source operand tnfor- 30 
mation via four opcode / type busses 50. 

Processor 10 includes several functional units, 
such as a branch unit 20, an Integer functional unit 21, 
a floating point functional unit 22 and a load / store 
functional unit 80. Integer functional unit 21 is pre- 35 
sented in a generic sense and represents units of va- 
rious types, such as arithmetic logic units, a shift unit 
and a special registers unit Branch unit 20 executes 
a branch prediction operation, a technique which al- 
lows an adequate instruction-fetch rate In the pres- 40 
ence of branches and is needed to achieve perfor- 
mance with multiple instruction issue. A suitable 
branch prediction system, including a branch unit 20 
and instruction decoder 18, is described in further de- 
tail in United States Patent No. 5,138,697 (William M. 45 
Johnson "System for Reducing Delay for Execution 
Subsequent to Correctly Predicted Branch Instruc- 
tion Using Fetch Information Stored with each Block 
of Instructions in Cache"), which is incorporated here- 
in by reference in its entirety. so 

Processor 10 is shown having a simple set of 
functional units to avoid undue complexity. It will be 
appreciated that the number and type of functional 
units are depicted herein in a specified manner, with 
a single floating point functional unit 22 and multiple 55 
functional units 20. 21 and 22 which generally per- 
form operations on integer data, but other combina- 
tions of integer and floating point units may be imple- 



mented, as desired. Each functional unit 20. 21, 22 
and 80 has respective reservation stations (not 
shown) having inputs connected to the operand buss- 
es 30 and 31 and the opcode / type busses 50. Res- 
ervation stations allow dispatch of speculative ROPs 
to the functional units. 

Register file 24 is a physical storage memory in- 
cluding mapped CISC registers for integer and float- 
ing point instructions. Register file 24 is addressed by 
up to two register pointers of the A and B-operand 
pointers 36 and 37 which designate a register number 
for source operands for each of up to four concurrent- 
ly dispatched ROPs. These pointers point to a register 
file entry and the values in the selected entries are 
placed onto operand busses of the operand busses 
30 and 31 through eight read ports. Integers are stor- 
ed in 32-bit <31:0> registers and floating point num- 
bers are stored in 82-bit <81:0> floating point regis- 
ters of the register file 24. Register file 24 receives in- 
teger and floating point results of executed and non- 
speculative operations from the reorder buffer 26 
over four 41-bit writeback busses 34. Results that are 
written from the reorder buffer 26 to the register file 
24 as ROPs are retired. 

Reorder buffer 26 is a circular FIFO for keeping 
track of the relative order of speculatively executed 
ROPs. The reorder buffer 26 storage locations are dy- 
namically allocated for sending retiring results to reg- 
ister file 24 and for receiving results from the func- 
tional units. When an instruction is decoded, its des- 
tination operand is assigned to the next available re- 
order buffer location, and its destination register nunrv 
ber is associated with this location, in effect renaming 
the destination register to the reorder buffer location. 
The register numbers of its source operands are used 
to access reorder buffer 26 and register file 24 simul- 
taneously. If the reorder buffer 26 does not have an 
entry whose destination pointer matches the source 
operand register number, then the value in the regis- 
ter file 24 is selected as the operand. If reorder buffer 
26 does have one or more matching entries, the value 
of the most recently allocated matching entry is furn- 
ished if it is available. If the result is unavailable, a tag 
identifying this reorder buffer entry is furnished on an 
operand tag bus of A and B-operand tag busses 48 
and 49. The result or tag is furnished to the functional 
units over the operand busses 30, 31 or operand tag 
busses 48, 49, respectively. When results are ot>- 
tained from completion of execution in the functional 
units 20, 21 , 22 and 80, the results and their respec- 
tive result tags are furnished to the reorder buffer 26. 
as well as to the reservation stations of the functional 
units, over five bus-wide result busses 32 and result 
tag busses 39. Of the five result and result tag and 
status busses, four are general purpose busses for 
forwarding integer and floating point results to the re- 
order buffer. Additional fifth result and result tag and 
status busses are used to transfer information, that is 
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not a forwarded result, from some of the functional 
units to the reorder buffer. For example, status infor- 
mation arising from a store operation by the 
load / store functional unit 80 or from a branch oper- 
ation by the branch unit 20 is placed on the additional 5 
busses. The additional busses conserve result bus 
bandwidth. Reorder buffer 26 handles exceptions 
and mispredictions, and maintains the state of certain 
registers, including a program counter and execution 
flags. A suitable unit for a RISC core, including a re- io 
order buffer is disclosed in our patent application 
94306869.2, which is incorporated by reference in its 
entirety. 

The instruction decoder 18 dispatches ROPs "in- 
order" to the functional units. The order is maintained i5 
in the order of reorder buffer entries. The functional 
units queue ROPs for issue when all previous ROPs 
In the queue have completed execution, all source op- 
erands are available either via the operand busses or 
result busses, and a result bus is available to receive 20 
a result. Thus, functional units complete ROPs "out- 
of-order". The dispatch of operations does not de- 
pend on the completion of the operations so that, un- 
less the processor is stalled by the unavailability of a 
reservation station queue or an unallocated reorder 25 
buffer entry, instruction decoder 18 continues to de- 
code instructions regardless of whether they can be 
promptly completed. 

It is preferable to define a data path having at 
least 32 bits of width for handling integer data. The 30 
data path includes registers in the register file 24 and 
a result field in each entry of the reorder buffer 26. as 
well as the operand, result and writeback busses. In 
one embodiment, the processor has a 41-bit data 
path to accommodate floating point operations. The 35 
32-bit data path is mapped into bits <31:0> of a 41- 
bit structure. 

A suitable load/store functional unit is disclosed 
in our European patent application 94306872.6, 
which is Incorporated herein by reference in its entire- 40 

ty. 

Multiple data types are represented by the 32-bit 
integer data structure depicted In Figures 2, 3, 4, 5 
and 6. Data structure 200, shown in Figure 2, is par- 
titioned into three fields - a 16-bit high field 217, an 45 
8-bit middle field 21 6 and an 8-bit low field 215. Adou- 
bleword 202 shown in Figure 3 represents a 32-bit in- 
teger data element that uses all the bits of the low. 
middle and high fields of the structure 200. A word 
204 shown in Figure 4 represents a 1 6-bit integer ele- so 
ment that uses all bits of the low and middle fields. 
Bytes 206 and 208, shown respectively in Figures 5 
and 8, represent 8-bit integer elements that employ 
either alt bits of the low field to define a low byte, or 
all bits of the middle field to define a high byte. Un- 55 
used bits of data elements that are smaller than 32 
bits are generally set to zero by various functional 
blocks within the processor 10. 



Each A-operand pointer, B-operand pointer and 
destination register pointer is encoded in nine bits, as 
is shown by the pointer 210 of Figure 7. The high or- 
der sbc bits <8:3>of the pointer 210 specify a register 
address which selects a particular register within the 
register file 24 that is operated upon by an ROP. The 
low order three bits (H. M and L) are field select bits 
which specify the fields of the register that are de- 
fined to be utilized by the ROP. The L bit selects the 
low bit field 215 of the data structure 200 of Figure 
2. The M bit selects the middle bit field 216 and the 
H bit selects the high bitfield 217. In this manner, the 
pointer 210 supports selection of a bit field, indepen- 
dently of the selection of the other bit fields. 

A detailed illustration of the register file 24 is 
shown in Figure 8. The register file 24 includes a read 
decoder 60, a register file array 62. a write decoder 
64, a register file control 66 and a register file oper- 
and bus driver 68. The read decoder 60 receives se- 
lected bits <8:3> of the A and B-operand pointers 36 
and 37 for addressing the register file array 62 via four 
pairs of 64-bit A and B operand address signals RAO, 
RA1, RA2. RA3, RBO, RBI, RB2 and RB3. The re- 
mainder of the A and B-operand pointer bits <2:0> are 
applied to the register file operand bus driver 68 to 
drive appropriate fields of operand data. 

The register file array 62 receives result data 
from the reorder buffer 26 via the four writeback buss- 
es 34. When a reorder buffer entry is retired in parallel 
with up to three other reorder buffer entries, result 
data for the entry is placed on one of the writeback 
busses 34 and the destination pointer for that entry 
is placed on a write pointer 33 that corresponds to the 
selected writeback bus. Data on writeback busses 34 
are sent to designated registers in the register file ar- 
ray 62 in accordance with address signals on write 
pointers busses 33 which are applied to the write de- 
coder 64. 

The register file control 66 receives override sig- 
nals on A operand override lines 57 and B operand 
override lines 58 from the reorder buffer 26, which are 
then conveyed from the register file control 66 to the 
register file operand bus driver 68. The register file ar- 
ray 62 includes multiple addressable storage regis- 
ters for storing results operated upon and generated 
by processor functional units. Figure 9 shows an ex- 
emplary register file array 62 with forty registers, in- 
cluding eight 32-blt integer registers (EAX, EBX, 
ECX. EDX, ESP. EBP. ESI and EDI), eight 82-bit 
floating point registers FPO through FP7, sixteen 41- 
bit temporary integer registers ETfAPO through 
ETMP15 and eight 82-bit temporary floating point 
registers FTMPO through FTIVIP7 which, in this env 
bodiment, are mapped into the same physical register 
locations as the temporary integer registers ETMPO 
through ETMP15. 

Refenring to Figure 10, reorder buffer 26 includes 
a reorder buffer (ROB) control and status block 70, a 
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ROB array 74, and a ROB operand bus driver 76. 
ROB control and status block 70 is connected to the 
A and B-operand pointers 36 and 37 and the destin- 
ation pointer (DEST REG) busses 43 to receive inputs 
which identify source and destination operands for an 5 
ROP. ROB array 74 is a memory array which is con- 
trolled by ROB control and status block 70. ROB ar- 
ray 74 is connected to the result busses 32 to receive 
results from the functional units. Control signals, in- 
cluding a head, a tail, an A operand select, a B oper- io 
and select and a result select signal, are conveyed 
from ROB control and status 70 to ROB array 74. 
These control signals select ROB array elements that 
are input from result busses 32 data and output to 
writeback busses 34, write pointers 33, A and B-oper- is 
and busses 30 and 31, and A and B-operand tag buss- 
es 48 and 49. Sixteen destination pointers, one for 
each reorder buffer array element, are conveyed from 
ROB array 74 to ROB control and status 70 for per- 
forming dependency checking. 20 

Figure 11 depicts an example of a reorder buffer 
array 74 which includes sixteen entries, each of which 
includes a result field, a destination pointer field and 
other fields for storing control infonmation. A41-bit re- 
sult field is furnished to store results received from 25 
the functional units. Two reorder buffer entries are 
used to store a floating point result. Integer results are 
stored in 32 of the 41 bits and the remaining nine bits 
are used to hold status flags. The destination pointer 
field (DEST_PTR<8:0>) of each ROB array 74 entry 30 
designates a destination register in register file 24. 

The operation of the reorder buffer 26 and regis- 
ter file 24 in combination is described with reference 
to Figures 8, 9, 10 and 11. As the instruction decoder 
1 8 dispatches ROPs, It provides source operand poin- 35 
ters to the register file 24 and the reorder buffer 26 to 
select the contents of a register or a reorder buffer en- 
try for application to the operand busses as a source 
operand using the four pairs of A and B-operand poin- 
ters 36 and 37. In a similar manner, the Instruction de- 40 
coder 18 provides a destination pointer to the reorder 
buffer 26 to identify a particular destination register 
of the thirty destination registers In the register file 
24, using the four destination register (DEST_REG) 
pointer busses 43. The destination register is select- 45 
ed to receive the result of an executed ROP. 

When an ROP is dispatched, an entry of the re- 
order buffer 24 Is allocated to it. The entry includes a 
result field, to receive result data from the functional 
units. The result field of each entry is defined as a so 
doubleword field, a word field, a high byte field, or a 
tow byte field, and receives a corresponding field of 
the result data when it becomes available upon exe- 
cution of an ROP. For a doubleword operand field, the 
16-bit high field, the 8-bit middle field, and the d-bit ss 
low field are all defined, as indicated by set bits 217, 
216 and 215 respectively. For a word operand field, 
only the 8-bit middle field and the 8-bit low field are 



defined, as Indicated by set bits 216 and 215 respec- 
tively. For a low byte operand field, only the 8-bit low 
field is defined, as indicated by set bit 215. For a high 
byte operand field, only the 8-bit middle field is de- 
fined, as indicated by set bit 216. The destination 
pointer DEST_PTR, which contains the register ad- 
dress in DEST_PTR<8:3> and the defined field bits 
217. 216 and 215 In DEST_PTR<2:0>, Is received by 
the reorder buffer control status 70 over destination 
register (DEST_REG) busses 43, and written into the 
destination pointer field DEST_PTR<8:0> of the allo- 
cated entry of the reorder buffer array 74. 

The pointer of the A or B-operand pointers 36 and 
37 addresses the ROB array 74, through the ROB 
control block 70, to designate the operand data to be 
applied to the ROB operand bus driver 76. ROB con- 
trol and status 70 receives the operand pointers via 
the A and B-operand pointers 36 and 37. 

The reorder tnif fer 26 accomplishes dependency 
checking by simultaneously comparing each of the 
pointers of Aand B-operand pointers 36 and 37 to the 
destination pointer fields of all sixteen elements of re- 
order buffer array 74 to detemnine whether a match, 
which identifies a data dependency, exists. Up to 
eight operand pointers are simultaneously compared 
to the destinatk)n pointers. For the high, operand field 
bits <8:3,2> of the operand pointer are compared to 
bits <8:3.2> of the sixteen destination pointer fields. 
For the middle operand field, bits <8:3.1> of the op- 
erand pointer are compared with bits <8:3,1> of the 
sixteen destination pointer fields. For the low operand 
field, bits <8:3,0> of the operand pointer are com- 
pared with bits <8:3,0> of the sixteen destination 
pointer fields. A match for a particular field occurs 
when the operand pointer bits <8:3> match the des- 
tination pointer bits <8:3> and the field select bit iden- 
tifying the particular field is asserted in both the op- 
erand pointer bits <2:0> and the destination pointer 
bits <2:0>. An operand pointer may match destination 
pointers for multiple reorder buffer entries. When one 
or more matches occur, a pointer to the matching re- 
order buffer entry closest to the tail of the queue is 
used to identify the appropriate operand data. This 
pointer Is called an operand tag. The reorder buffer 26 
furnishes three operand tags for each operand, one 
for each field of the high, medium and low operand 
fields. The operand tags are applied as pointers to the 
reorder buffer entries to drive result data onto an op- 
erand bus. The high, medium and low field select bits 
of the operand pointer drive the reorder buffer result 
data respectively onto bits <31:16>, [[GDZiOR IS IT 
BITS <40:16>D <15:8> and <7:0> of the operand bus. 

The status and control field <23:0> of the reorder 
buffer array 74 includes a result valid bit which is as- 
serted when the reorder buffer 26 receives a result 
from a functional unit The result valid bit applies to 
all of the high, medium and low fields of the result. Sh 
multaneously for the three fields, the ROB control 
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block 70 addresses the ROB array 74 using the field- 
specific operand tags as pointers. The ROB array 74 
furnishes appropriate bits of the result field of the en- 
try to the ROB operand bus driver 76. A and B-oper- 
and pointer bits <2:0> are applied to the ROB operand 5 
bus driver 76 to define the H, M and L bit fields to be 
driven onto the operand busses. Fields that are not 
defined are driven as zeros onto an operand bus. 

As the instruction decoder 18 drives opcodes 
onto the opcodeAype busses 50 and the reorder buf- io 
fer 26 drives operands onto the operand busses 30 
and 31 , the reorder buffer 26 also drives the operand 
tags for each of the high, medium and low operand 
fields onto the operand tag busses 48 and 49. The 
three operand tags each include an identifier of the is 
reorder buffer entry from which a source operand is 
sought, regardless of whether the data is available. 
An operand tag valid bit is associated with each of the 
high, medium and low field operand tags. An operand 
tag valid bit is asserted to indicate that operand data 20 
Is not available. The operand tag valid bit is obtained 
for a reorder buffer entry by inverting the result valid 
bit of the entry. Thus, there are three independent 
tags in the A and B-operand tag busses 48 and 49 as- 
sociated with each operand bus 30 and 31 that supply 25 
tagging information for the low. middle and high data 
fields. Each of the high, medium and low operand 
tags is independent of the other tags so that, for the 
different fields, data may. be driven onto operand 
busses from different reorder buffer entries, the 30 
same entry or data may not be driven from an entry. 

In the event that a particular field of an operand 
is driven by the register file 24, the ROB 26 does not 
assert the corresponding operand tag valid bit, there- 
by indicating that operand data for the particular field 35 
is available. A suitable dependency checking circuit is 
described In detail in our European patent application 

based on US application 

08/233.568 (Ref. PCCS/TT0370/SMP) being filed on 
even day herewith, which Is hereby Incorporated by 40 
reference. 

The reorder buffer 26 sends override signals 
whenever it detects a dependency, whether the result 
is held in the result field of the reorder buffer entry or 
the result is unavailable awaiting execution of an 45 
ROP. In either case, the reorder bus control 70 sets 
override signals for each dependent field of an oper- 
and to the register file 24 via an appropriate one of A 
override lines 57 or B override lines 58. The reorder 
buffer control 70 overrides the read operation of any so 
dependent low. middle or high fields of a register file 
array 62 entry by setting bits of the override busses 
57 and 58 that are applied to the register file operand 
bus driver 68. The A override busses 57 and the B 
override busses 58 include three forwarded-operand 55 
bits for each of the four A and B-operand busses 30 
and 31. The reorder buffer 26 controls overriding of 
any data fields and the register file 24 responds by 



disabling the register file operand bus driver 68 as in- 
structed. Thus, an attempt to place data from the reg- 
ister file 24 onto an operand bus is overridden, but 
only for the operand fields for which a dependency ex- 
ists. 

The read decoder 60 receives the A and B-oper- 
and pointers 36 and 37 and decodes operand pointer 
36 and 37 to select registers in the register file 24. The 
read decoder 60 decodes the high six bits of the op- 
erand pointer 36 and 37 element to select a register. 
The value from the accessed register is latched and 
driven onto one of the four pairs of 41 -bit A or B-op- 
erand transfer lines connecting the register file array 
62 to the register file operand bus driver 68. Bit pos- 
itions that are not implemented in the integer registers 
of the register file array 62 are read as logical zeros 
on these busses. The register file operand bus driver 
68 drives the latched values selected in accordance 
with the H, M and L bit fields defined by bits <2:0> of 
operand pointers 36 and 37 onto A and B-operand 
busses 30 and 31. The register file control 66 re- 
ceives the A and B ovenride signals 57 and 58 from 
the reorder buffer 26 to direct the ovenride of a register 
file read operation in any of the low, middle or high 
fields of the entry. 

If the reorder buffer 26 determines that source 
operand data are not dependent on unavailable data 
and are therefore available either from the register 
file 24 €»- the reorder buffer 26, the operand data are 
sent via operand busses 30 and 31 to the functional 
unit reservation stations. 

As functional units complete execution of opera- 
tions and place results on the result busses 32, ROB 
control and status 70 receives pointers from the re- 
sult tag busses 32 which designate the corresponding 
ROB array entries to receive data from the result 
busses 32. ROB control 70 directs the transfer of data 
from the result busses 32 to the ROB array 74 using 
four result select pointers. 

ROB control and status 70 retires an ROP. com- 
municating the result to the register file 24, by placing 
the result field of an ROB array 74 element on one of 
the writeback busses 34 and driving the write pointer 
33 corresponding to the writeback bus with the des- 
tination pointer. Write pointer 33 designates the reg- 
ister address within register file 24 to receive the re- 
tired result. 

Referring to Figure 12, without register renaming 
resource conflicts arise in which a subsequent in- 
struction must wait for the completed execution of a 
previous instruction to resolve the conflict. This phe- 
nomenon is illustrated by the following sequence of 
x86 instructions: 

mov ah,byte1 

mov albyte2 

mov word12,ax 
This code sequence may be used in a loop to inter- 
leave two byte strings or to swap the byte order of 1 6- 
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bit data. The first instruction loads byte1 into register 
AH of register EAX. The second instruction loads 
byte2 into register AL of register EAX. The third in- 
struction stores the AX register contents into a 16-bit 
memory location, named word12. 5 

In one implementation of register renaming, a 
modification to any part of the register EAX creates 
a new instance of the full register. This is appropriate 
for handling independent operations on the full regis- 
ter in parallel. However, to modify only a part of reg- io 
ister EAX, such as in the first and second instructions 
above, and still be able to forward the full register 
contents to subsequent operations (the third instruc- 
tion), the current contents of register EAX must be 
supplied to the functional unit for merger with the new is 
field value to create the new 32-bit contents of regis- 
ter EAX. This generates a dependency of the second 
instruction upon the first instruction, shown by arrow 
A of Figure 12, so that the instructions execute in a 
serial manner. The third instruction Is dependent 20 
upon the second, shown by arrow B, so that none of 
the three instructions can be executed in parallel. 

Furthermore, the destination register becomes a 
required input operand. Many x86 instructions have a 
two-operand form in which the destination is also one 25 
of the Inputs. However, several instructions are de- 
fined in which this is not the case and the destination 
becomes a third input. Since the dependency han- 
dling logic must handle any of these cases, a greater 
burden is placed on the logic that is used only for 30 
these few exceptional instructions. 

In a preferred Implementation of register renam- 
ing, shown in Figure 13, partial fields of a register 
(EAX) are renamed so that the second Instruction 
does not depend on the first. Only the third instruction 35 
is dependent upon a previous instruction so that the 
first two instructions may execute in parallel and exe- 
cution of only the third instruction is delayed due to a 
dependency condition. This preferred implementa- 
tion of register renaming reduces the total time for the 40 
sequence from three cycles to two. Additional accel- 
eration of the processor is accomplished since the 
data that results from the execution of the first two in- 
structions is placed on the result busses and forward- 
ed for execution of the third instruction. 45 

Figure 14 illustrates a generic functional unit 22 
that Incorporates a generally standard set of compo- 
nent blocks and supports operand data having select- 
able variable bit widths. Functional units may differ 
from the generic embodiment with respect to various so 
details. For example, a functional unit may have sev- 
eral reservation stations and access more than one 
set of operand busses and result busses at one time. 
The generic functional unit 22 includesan Amultiplex- 
er 41, a B multiplexer 42 and a tag-opcode multiplexer 55 
45 for receiving input data and control signals. The 
generic functional unit 22 also includes a reservation 
statbn 44, an execution unit 95 and a result bus driver 



93. The reservation station 44 includes a FIFO 51, a 
tag-opcode FIFO 89. an A operand forwarding a'rcuit 
90 and a B operand forwarding circuit 92. 

The A multiplexer 41 is a 4:1 multiplexer that Is 
connected to the four 41 -bit A-operand busses 30 and 
the four A-operand tag busses 48 to receive respec- 
tive input operands and operand tags. Similarly, the 
B multiplexer 42 is a 4:1 multiplexer that is connected 
to the four 41 -bit B-operand busses 31 and the four 
B-operand tag busses 49. The tag-opcode multiplex- 
er block 45 includes type comparison logic (not 
shown) and two multiplexers, a tag multiplexer (not 
shown) that is connected to the four destination tag 
busses 40 and an opcode multiplexer (not shown) 
that is connected to the four opcode / type busses 50. 
The tag multiplexer and the opcode multiplexer are 
4:1 multiplexers. The bus select signal connects type 
comparison logic, the tag multiplexer and the opcode 
multiplexer internal to the tag-opcode multiplexer 
block 45 and is connected to the A multiplexer 41 and 
the B multiplexer 42. 

Within the reservation station 44, the tag-opcode 
FIFO 89 is connected to the tag-opcode multiplexer 
45 by destination tag lines and opcode lines that cor- 
respond respectively to a selected bus of the destin- 
ation tag busses 40 and to a selected bus of the op- 
code / type busses 50. The FIFO 51 is connected to 
the A multiplexer 41 by a first set of lines that corre- 
spond to a selected bus of the A-operand busses 30 
and by a second set of lines that correspond to a se- 
lected bus of the A-operand tag busses 48. The first 
set of lines are connected internal to the FIFO 51 to 
an Adata FIFO 52. Internal to the FIFO 51, the Adata 
FIFO 52 has lines which connect to the Aoperand for- 
warding circuit 90. The second set of lines are con- 
nected internal to the FIFO 51 to an A tag FIFO 53. 
The FIFO 51 is connected to the B multiplexer 42 by 
a third set of lines that correspond to a selected bus 
of the B-operand busses 31 and by a fourth set of 
lines that correspond to a selected bus of the B-oper- 
and tag busses 49. The third set of lines are connect- 
ed internal to the FIFO 51 to a B data FIFO 55. Inter- 
nal to the FIFO 51. the B data FIFO 55 has lines which 
connect to the B operand forwarding circuit 92. The 
fourth set of lines are connected internal to the FIFO 
51 to an B tag FIFO 56. The Aoperand forwarding cir- 
cuit 90 is connected to the A tag FIFO 53, the five re- 
sult tag busses 39 and the five result busses 32. Sim- 
ilarly, the B operand forwarding circuit 92 is connect- 
ed to the B tag FIFO 56, the five result tag busses 39 
and the five result busses 32. 

The execution unit 95 is connected to the reser- 
vation station 44 using A operand lines from the A 
data FIFO 52. B operand lines from the B data FIFO 
55, and destination tag lines and opcode lines from 
the tag-opcode FIFO block 89. The execution unit 95 
is also connected to a result grant signal input from a 
result bus arbitrator (not shown). The result bus driver 
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93 is connected at its outputs to a result request sig- 
nal line which connects to the result bus arbitrator, the 
result tag busses 39 and the result busses 32. 

The functional unit is activated when a type code 
nr)atch occurs between a dispatched ROP and a f unc- 5 
tlonal unit. A type code match takes place when a type 
code on one of the four type code busses corre- 
sponds to the type code assigned to the functional 
unit When a type code matches, the tag-opcode mul- 
tiplexer 45 generates a bus select signal that speci- io 
Vies the particular bus of the operand, operand tag, 
destination tag and opcode / type busses to be se- 
lected. The bus select signal is applied to the tag mul- 
tiplexer and the opcode multiplexer of the tag-opcode 
multiplexer 45, the A multiplexer 41 and the B multi- i5 
plexer 42, directing operand and tag information into 
the reservation station 44. The selected destination 
tag and the selected opcode are written into the tag- 
opcode FIFO 89. The tag FIFO and the opcode FIFO 
of the tag-opcode FIFO 89 are temporary memories 20 
for holding the destination tag as well as the local op- 
code. The tag Identifies the entry within the reorder 
buffer 26 into which the ROP and its result are even- 
tually written after the instruction is executed and its 
result is placed on the result busses 32. Thus, the A 25 
and B operands, A and B operand tags, the destina- 
tion tag and the opcode are held in the reservation 
station 44 and pushed through the FIFO 51 and tag- 
opcode FIFO 89 for each reservation station entry. 

The selected A operand data and A operand tag 30 
are respectively written into the A data FIFO 52 and 
the A tag FIFO 53. The selected B operand data and 
B operand tag are respectively written into the B data 
FIFO 55 and the B tag FIFO 56. Each of the A tag 53 
and B tag 56 entries in the FIFO 51 includes three op- 35 
erand tags corresponding to the high, medium and 
low fields of an operand. As shown generally in Figure 
15, each of the high, medium and low operand fields 
100, 101 and 102 have an associated operand tag val- 
id bit 106, 107 and 108. Operand tag valid bits from 40 
the operand tag bus are directed through a multiplex- 
er to a tag FIFO in the FIFO 51. A tag FIFO entry in- 
cludes the high, medium and low operand tags 103, 
104 and 105, which are written Into the tag FIFO to- 
gether. The tag FIFO also includes the operand tag 45 
valid bits 106, 107 and 108 which indicate for each 
field, when set, that operand data is not available. 
Thus, if a field is defined and data is not available in 
the register file 24 or reorder buffer 26, the reorder 
buffer 26 drives onto the operand tag bus an asserted 50 
operand tag valid bit, accompanied by the operand 
tag which Identifies the reorder buffer entry into 
which the unavailable data Is written when it becomes 
available. If a field is undefined with respect to the 
ROP or data is available, the operand tag valid bit for 55 
that field is nonasserted. 

The purpose of a reservation station is to allow an 
instruction decoder to dispatch speculative ROPs to 



functional units regardless of whether source oper- 
ands are currently available. This allows a number of 
speculative ROPs to be dispatched without waiting 
for a calculation or a load / store to complete. The A 
data 52, Atag 53, B data 55, B tag 56, and destination 
tag and opcode FIFOs of the tag-opcode FIFO block 
89 are two-deep FIFOs so that the reservation station 
44 can hold two source operands and tags plus the 
information on the destination and opcode in each of 
the entries. 

The reservation station 44 also forwards source 
operands that were unavailable at dispatch directly 
from the result busses 32 using the operand tags and 
the operand tag valid bits stored therein. When all ap- 
propriate A and B operand data fields are present in 
FIFO 51, the functional unit arbitrates for a bus of the 
result busses 32 using a result request signal from the 
result bus driver 93. 

When a result bus is available and the result grant 
signal is asserted, the execution unit 95 executes the 
ROP and conveys result data to the result bus driver 
93. Depending on the type of functional unit, the exe- 
cution unit 95 executes one or more operation of a va- 
riety of operations that are standard in processors, in- 
cluding integer or floating point arithmetic, shifting, 
data load/store operations, data comparison and 
branching operations, and logic operations, forexanv 
ple. 

The execution unit 95 also arranges the data for 
output to the result busses 32. For single-byte oper- 
and opcodes, a one-bit AHBYTE (high byte) signal de- 
termines whether an 8-bit register operand is a high 
byte or a low byte, residing in the middle M or low L 
register fields, respectively, as is shown by the data 
structure 200 of Figure 2. If the AHBYTE signal is set, 
execution unit 95 locally remaps data in the middle 
field (bits<15:8>) into the low bit field position (bits 
<7:0>) before executing an ROP. The remapping op- 
eration includes the operations of sign extending or 
zero extending the remapped high bytes, in accor- 
dance with the specified operation. High bytes are al- 
ways read from the middle field from the register file 
24, the reorder buffer 26, the operand busses 30 and 
31 and the result busses 32. The high byte is remap- 
ped locally by functional units that perform right- 
justified operations. 

The result bus driver 93 drives the result data 
onto the available 41 -bit result bus and the corre- 
sponding entries In the reservation station data, op- 
erand tag, destination tag and opcode FIFOs are 
cleared. The result bus driver 93 drives the destina- 
tion tag from the destination tag FIFO onto the result 
tag bus that Is associated with the available result 
bus. In addition to the destination tag, the result bus 
driver 93 sets status indication signals on the result 
bus including normal, valid and exception flags, for 
example. 

The A operand forwarding circuit 90 and the B op- 
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erand forwarding circuit 92 monitor the result tag 
busses 39 to detect a result that satisfies a data de- 
pendency that is delaying execution of an ROP within 
the FIFO 51. A result tag identifies the reorder buffer 
entry into which the result is written. The A operand 5 
forwarding circuit 90 nnonitors the result tag busses 
39 and compares the result tags carried thereon to a 
tags from the A tag FIFO 53. In this monitoring oper- 
ation, the forwarding circuit compares each of the 
high, medium and low-order operand tags to the result io 
tag ROB entry identifier Although ail three fields are 
tested simultaneously, each of the three fields is test- 
ed independently of the other fields. 

For each field, when the field-specific operand 
tag matches the result tag, a data dependency is re- 15 
solved for that field. The result data for the field is for- 
warded into the corresponding data FIFO entry and 
written into the bits of the field. The operand tag valid 
bit for the corresponding tag FIFO entry and field is 
cleared to indicate that the data dependency relating 20 
to that field is resolved. 

When data dependencies are resolved for all 
fields of all source operands of an ROP, and the func- 
tional unit is not busy and a result bus is available, the 
ROP is executed. In a similar manner, the B operand 25 
forwarding circuit 92 monitors the result tag busses 
39 and compares the result tags carried thereon to a 
tag from the B tag FIFO 56. The result bus driver 93 
always drives a result onto a result bus with each field 
appropriately positioned so that data is consistently 30 
presented in the correct position when data is for- 
warded to the reservation station 44. 

Referring to Figure 16, load/store functional unit 
80 executes LOAD and STORE instructions and in- 
teracts with the data cache 86. Load/store functional 35 
unit 80 includes a dual-ported reservation station 85, 
a four-entry store buffer 84 and a load/store result 
bus driver 87. Each port is connected to the store buf- 
fer 84 and the data cache 88 by a channel, which in- 
cludes 40 data bits and a suitable number of address 40 
bits. The reservation station 85 includes a multiplexer 
81, a load store controller 83, a merge circuit 91 and 
a FIFO 82 for queuing up to four ROPs. 

The multiplexer 81 includes 4:1 multiplexers that 
are connected to the A and B-operand and tag busses 45 
30, 31, 48 and 49. Each FIFO entry in the reservation 
station 85 holds all of the information fields that are 
necessary to execute a load or store operation. In one 
processor clock cycle, up to two ROPs are issued and 
up to two FIFO entries are retired. The load/store res- so 
ervation station 85 is connected, at its inputs, to the 
four Aand B operand busses 30 and 31 . the four Aand 
B operand tag busses 48 and 49, the five result buss- 
es 32, the four destination tag busses 40 and the four 
opcode/type busses 50. The reservation station 85 is 55 
also connected to the data portions of ports Aand B 
of data cache 86. Reservation station 85 is connected 
to store buffer 84 using Aand B port reservation sta- 



tion data busses RSDATA Aand RSDATA B, respec- 
tively, and A and B port reservation station address 
busses RSADDR A and RSADDR B, respectively, 
which are also connected to the address lines of ports 
A and B of the data cache 86. Reservation station 85 
is connected to controller 83 using a reservation sta- 
tion load bus RSLOAD and a reservation station shift 
bus RSHIFT. The store buffer 84 is connected to the 
load/store result bus driver 87, the address/data bus 
11, and the load store controller 83 using a store buf- 
fer load bus SBLOAD and a store buffer shift bus 
SBSHIFT. In addition to connections with reservation 
station 85 and store buffer 84, load store controller 83 
is connected to data cache 86 and reorder buffer 26. 
In addition to connections to store buffer 84, the 
load/store result bus driver connects to the data 
cache 86 and to the five result busses 32 and the five 
result tag busses 39. 

Data cache 86 is a linearly addressed 4-way in- 
terleaved, 8 Kbyte 4-way set associative cache that 
supports two operations per clock cyde. Data cache 
86 is arranged as 128 sixteen byte entries. Each 16 
byte entry is stored in a line of four individually ad- 
dressed 32-bit banks. Individually addressable banks 
permit the data cache 86 to be accessed concurrently 
by two ROPs, such as two simultaneous load opera- 
tions, while avokJing the overhead kientified with 
dual porting. 

A load operation reads data from the data cache 
86. During a load operation, reservation station 85 
supplies an address to data cache 86. If the address 
generates a cache hit, data cache 86 furnishes the 
data which is stored in a corresponding bank and 
block of a store array (not shown) of the data cache 
86 to reservation station 85. Adoubleword is transfer- 
red from the data cache 86 to the load/store result bus 
driver 87. The upper two bits of the load/store instruc- 
tion opcode specify the size of the result to be pro- 
duced. The types of results are doublewords, words, 
high bytes or low bytes. Unused bits are set to zero. 
For high bytes, the result produced by executing the 
ROP is remapped into the middle bit field before the 
result is driven onto the result busses 32 by the 
load/store result bus driver 87. High bytes are always 
read from the middle bit field of the operand. 
Load/store result bus driver 87 masks unused por- 
tions of data that are read by the doubleword read op- 
eration. If the AHBYTE signal is set, the load/store re- 
sult bus driver 87 remaps the low field data bits <7:0> 
into the middle field bits <15:8>. The bus driver 87 
then drives the result on one of the result busses 32. 
If the address was supplied to data cache 86 over port 

A, then the data is provided to reservation station cir- 
cuit 85 via port A. Otherwise, if the address was pre- 
sented to data cache 86 using port B, then the data 
is communicated to reservation station 85 using port 

B. Addresses are communicated to data cache 86 and 
data is received f rom data cache 86 using ports A and 
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B simultaneously. As the load/store result bus driver 
87 drives the result onto one of the result busses 32, 
it also drives the conresponding one of the result tag 
busses 39. 

A store operation Is a doubleword read operation 5 
from data cache 86, followed by a doubleword write 
back to the cache 86. During a store operation, an ad- 
dressed doubleword is first transferred from data 
cache 86 to store buffer 84. Then the data is commu- 
nicated from reservation station 85 to store buffer 84. 10 
If the store data is 32 bits or more in width, the data 
replaces the doubleword that was read from data 
cache 86. If the store data is less than 32 bits in width, 
the merge circuit 91 merges the applicable data fields 
into the doubleword that was read from data cache is 
86. If a portion of the store data is not available, then 
an operand tag is used to replace the unavailable 
data. The mix of data and tags is held in the store buf- 
fer until all bit fields of missing data are forwarded 
from the result busses. By holding partial data in the 20 
store buffer 84 until all fields are available, only full 
doublewords are written to cache 86. Writing of indi- 
vidual 8-bit bytes is not necessary. The merged data 
is then communicated back to the data cache 86 by 
the load / store result bus driver 87. Load and store 25 
operations of store data that are greater than 32 bits 
in width execute multiple accesses to the data cache 
86 and construct the data in store buffer 84 before 
writing it back to the data cache 86. When the store 
operation is released, the data and corresponding ad- 30 
dress are communicated using address/data bus 11 
to data cache 86. This embodiment is described In our 

copending European patent application 

based on US application 08/233,563 (Ref. 

PCCS/TT0221 /SMP), which is incorporated herein by 35 
reference in its entirety. 

While the invention has been described with ref- 
erence to particular embodiments, it will be under- 
stood that the embodiments are illustrative and that 
the invention scope is not so limited. Many variations, 40 
modifications, additions and improvements to the 
embodiment described are possible. For example, 
the invention may be implemented on a processor 
other than an x86 architecture processor or a CISC 
architecture processor. Also, the bit width of the data 45 
path may be different from 32 bitsor41 bits. The data 
path may be partitioned into more or fewer fields than 
three. The number of bits in the various structures 
and busses is illustrative, and may be varied. The size 
of the register file and the reorder buffer, the number so 
of operand buses and operand tag busses, the num- 
ber of result buses, the number of writeback buses, 
and the type and number of functional units are illus- 
trative, and may be varied. The invention may be 
practiced in a processor that is not a superscalar proc- 55 
essor or a pipelined processor, although the advan- 
tages of the invention are greater in a superscalar inrv 
plementation. These and other variations, modifica- 



tions, additions and improvements may fall within the 
scope of the invention as defined in the following 
claims. 



Claims 

1 . A method of handling operand data in a processor 
which executes operations utilizing operands of a 
variable size, comprising the steps of: 

partitioning an operand utilized by an op- 
eration into a plurality of fields; 

designating each partitioned field as de- 
fined or undefined with respect to the operation; 

detecting data dependencies of each of 
the defined fields, independently of the other par- 
titioned fields; and 

forwarding result data for utilization by the 
operation when the result data becomes avail- 
able for each of the dependent fields indepen- 
dently of the other partitioned fields. 

2. A method as in Claim 1, further comprising the 
steps of: 

executing an operatk>n to produce result 
data; and 

storing the result data in one of a plurality 
of memory elements. 

3. A method as in Claim 2, further comprising the 
step of utilizing data stored in a memory element 
for each non-dependent field independently of 
the other fields. 

4. A method as in Claim 3, wherein operands are 
source operands that are operated upon by an 
operation and destination operands that are gen- 
erated by an operation, further comprising the 
steps of: 

tagging each defined field of a first opera- 
tion's destination operand independently of the 
other fields with a destination tag that identifies 
a memory element which receives the opera- 
tion's result data; and 

tagging each defined field of a second op- 
eration's source operand independently of the 
other fields with a source tag that identifies a 
memory element which supplies the operation's 
operand data. 

wherein the dependency detecting step in- 
cludes the step of: 

comparing the destinatk>n tag to the 
source tag, and 

activating the forwarding step when the 
tags mutually correspond for each defined field 
independently of the other fields. 

5. A data handling apparatus in a processor which 
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executes operations utilizing operands of a vari- 
able size, the apparatus comprising: 

means for partitioning an operand utilized 
by an operation into a plurality of fields; 

means responsive to the partitioning 5 
means for designating each partitioned field as 
defined or undefined with respect to the opera- 
tion; 

means responsive to the designating 
means for detecting datai dependencies of each io 
of the defined fields independently of the other 
partitioned fields; 

means responsive to the dependency de- 
tecting means for forwarding result data for utili- 
zation by the operation when the result data be- is 
comes available for each of the dependent fields 
independently of the other partitioned fields; and 

a functional unit responsive to the for- 
warding of result data to execute the operation 
and generate a result 20 

6. An apparatus as in Claim 5, further comprising: 

a memory coupled to the functional unit in- 
cluding memory elements to store result data. 

25 

7. An apparatus as in Claim 6, further comprising: 

means for utilizing data stored in the mem- . 
ory elements for each non-dependent field inde- 
pendently of the other fields. 

30 

8. An apparatus as in Claim 7, wherein operands in- 
clude source operands that are operated upon by 
an operatbn and destination operands that are 
generated by an operation, the apparatus further 
comprising: 35 

means for tagging each defined field of a 
first operation's destination operand indepen- 
dently of the other fields with a destination tag 
which identifies a menrK>ry element that receives 
the operation's result data; and 40 

means for tagging each defined field of a 
second operation's source operand independent- 
ly of the other fields with a source tag which iden- 
tifies a memory element that supplies the opera- 
tion's operand data, 45 

wherein the dependency detecting means 
includes: 

a comparator which compares the destin- 
ation tag to the source tag and activates the for- 
warding means when the tags mutually corre- 50 
spond for each defined field independently of the 
other fields. 

9. A processor as in Claim 5, wherein the operand 

has a bit width of 32 bits and is partitioned by the 55 
partitioning means into a three fields including a 
high order 16 bit field, a middle order 8 bit field 
and a low order 8 bit field. 



10. A processor which executes operations utilizing 
operands of a variable size, comprising: 

an instruction decoder including 

means for partitioning an operand 
into a plurality of contiguous fields within the va- 
riable size operand of an operation, and 

means for designating each field of 
the operand as defined or undefined with respect 
to the operation; 

a reorder buffer coupled to the instruction 
decoder including 

a memory storing operand data and 
means for detecting data depend- 
encies for each operand field independently of 
the other fields; 

a bus coupled to the reorder buffer to com- 
municate operand data for each defined field in- 
dependently of the other fields; and 

a functional unit coupled to the bus to exe- 
cute operations and generate execution result 
data. 

11. A processor as in Claim 10. wherein the bus in- 
cludes: 

an operand bus coupled from the reorder 
buffer output to the functional unit input to conv 
municate operand data; and 

a result bus coupled from the functional 
unit output to the inputs of the reorder buffer and 
the functional unit to communicate result data. 

12. A processor as in Claim 11, wherein the reorder 
buffer further comprises: 

means for assigning a destination tag 
identifier which designates a memory element to 
receive operation execution result data for each 
operand field independently of the other fields; 

means for assigning an operand tag iden- 
tifier which designates a source operand of a 
data dependent operation for each operand field 
independently of the other fields, the processor 
further comprising: 

a destination tag bus coupled from the re- 
order buffer output to the functional unit input to 
convey the destination tag identifier of an opera- 
tion antecedent to its execution by a functional 
unit; 

a result tag bus coupled from the function- 
al unit output to the inputs of the functional unit 
and the reorder buffer to convey the destination 
tag identifier of an operation subsequent to its 
execution by a functional unit; 

an operand tag bus coupled from the reor- 
der buffer output to the functional unit input to 
convey the operand tag identifier of a data de- 
pendent operation; and 

a reservation station associated with the 
functional unit and coupled to the operand tag 
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bus, the result tag bus and the result bus, the res- 
ervation station including: 

a comparator which compares an 
identifier from the operand tag bus to an identi- 
fier from the result tag bus for each defined field s 
independently of the other fields; and 

a forwarding circuit coupled to the 
result bus and the comparator and responsive to 
mutually corresponding operand tag and result 
tag identifiers to forward result data for each de- io 
fined field independently of the other fields. 

13. A processor as in Claim 12, further comprising: 

a register file coupled to the reorder buffer 
and the operand bus and having a memory is 
wherein: 

the reorder buffer memory stores specula- 
tive operation results, and 

the register file stores operation results 
from the reorder buffer memory when the opera- 20 
tions become nonspecutative. 

14. A processor as in Claim 13, further comprising: 

a reorder buffer bus driver including 
means for driving data from the reorder buffer 25 
memory onto an operand bus for each field hav- 
ing speculative data available independently of 
the other fields; and 

a register file bus driver including means 
for driving data from the register file memory onto 30 
an operand bus for the remaining fields other 
than those driven by the reorder bus driver. 

15. A processor as in Claim 14, further comprising: 

a result bus driver-coupled to the function- 35 
al unit and responsive to the execution of an op- 
eration to drive result data from the functional 
unit onto a result bus. 

16. A processor as in Claim 15, wherein the function- 40 
al unit further comprises means for setting a re- 
sult field to 0 for each undefined field. 

17. A processor as in Claim 10, wherein the operand 

has a bit width of 32 bits and is partitioned by the 45 
partitioning means into a three fields including a 
high order 16 bit field, a middle order 8 bit field 
and a low order 8 bit field. 

18. In a processor which executes multiple concur- so 
rent operations, a method of executing opera- 
tions and handling operation-associated variable 

size source and destination operands, compris- 
ing the steps of: 

partitioning a full-sized operand field bit- 55 
wise into a plurality of independent operand 
fields; 

designating the operand fields as defined 



or undefined with respect to the operation; 

accessing source operand data from a 
memory for each defined field independently of 
other fields; 

executing the operation to generate a re- 
sult; and 

furnishing to the memory the result as the 
destination operand data for each defined destin- 
ation field independently of other fields. 

19. A method as in Claim 18, further comprising the 
steps of: 

detecting data dependencies for each de- 
fined source operand field independently of other 
fields; 

forwarding a result for each dependent 
source operand field Independently of other 
fields when the result is available; and 

executing the operation when results are 
forwarded for all dependent source operand 
fields. 

20. A processor which executes multiple operations 
concurrently utilizing variable size source and 
destination operands, the processor comprising: 

an instruction decoder means for parti- 
tioning a full-sized operand bit-wise into a plural- 
ity of independent contiguous fields; 

means coupled to the partitioning means 
for designating the fields of the source and des- 
tination operands as defined or undefined with 
respect to an operation; 

an operand data memory; 

means coupled to the memory and cou- 
pled to the designating means for accessing 
source operand data from the memory for each 
defined source operand field independently of 
other fields; 

means coupled to the accessing means for 
executing the operation utilizing the accessed 
source operand data to generate a result; and 

means coupled from the executing means 
to the memory for furnishing the result to the des- 
tination operand memory for each defined des- 
tination field independently of other fields. 

21. A processor as in Claim 20, further comprising: 

means for detecting data dependency 
each defined source operand field independently 
of other fields; 

means for forwarding a result for each de- 
pendent source operand field independently of 
other fields when the result is available; and 

means for executing the operation when 
results are forwarded for all dependent source 
operand fields. 

22. In a processor, a method for executing operations 
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utilizing variable^sized operands comprising the 
steps of: 

partitioning full-sized operands bit-wise 
into several operand fields; 

for each operation, determining for each 
operand field independent of the other fields 
whether data of each operand field that is utilized 
by the operation is dependent on an unavailable 
result of a nonexecuted operation; and 

executing the operation utilizing the oper- 
and field data if data in all utilized fields are not 
dependent on an unavailable result, and other- 
wise waiting for dependent data in the fields to 
become available and then executing the opera- 
tion. 

23. A method as in daim 22, wherein the dependence 
determining step further comprises the steps of: 

defining for an operation a destination op- 
erand and its fields and a source operand and its 
fields; 

storing identifiers of the destination oper- 
and and its fields for several operations; 

comparing the source operand and field 
identifiers to the stored destination operands and 
fields; and 

determining a data dependency when 
source operand and field identifiers match stored 
destination operand and field identifiers. 

24. A processor which executes operations operating 
on variat>le-sized operands, comprising: 

an instruction decoder including: 

an operand field selector which de- 
fines a variable-sized operand by partitioning a 
full-sized operand bit-wise into several operand 
fields and identifying operand fields utilized by 
the operation, and 

a dispatcher coupled to the oper- 
and field selector to dispatch operation codes, 
operand identifiers and utilized operand field 
identifiers; 

a reorder buffer coupled to the instruction 
decoder and including: 

a speculative result memory cou- 
pled to the dispatcher to receive operand identi- 
fiers and utilized field identifiers; 

a reorder buffer controller respon- 
sive to the dispatching means to allocate entries 
in the speculative result memory; 

a data dependency detector cou- 
pled to the reorder buffer controller which detects 
data dependencies of a utilized operand field in- 
dependently of other fields, and 

a tagging circuit coupled to the de- 
pendency detector and responsive to a data de- 
pendency to tag a data dependent operand field; 
an operand bus coupled to the reorder buf- . 



fer for communicating operand fields and oper- 
and field tags; 

a functional unit coupled to the operand 
bus to receive the operand fields and the operand 
5 field tags and execute operations defined by the 

dispatched operation code utilizing the operand 
fields to generate result data; and 

a result bus connected from the functional 
unit output to the inputs of the reorder buffer and 
10 the functional unit to forward operation result 
data thereto. 

25. A processor as in daim 24, further comprising: 

a register file including a nonspeculatlve 
15 result memory; 

a destination tag bus coupling the reorder 
buffer to the functional units to conrvnunicate a 
destination tag to the functional unit, the destin- 
ation tag specifying, for each utilized operand 
20 field, a destination register in the register file to 

receive a result when the result becomes nonspe- 
culatlve. 

26. A processor as in claim 27, further comprising: 
25 an operand tag bus coupling the reorder 

buffer to the functional unit to communicate the 
operand field tag to the functional unit; and 

a result tag bus connected from the func- 
tional unit output to the inputs of the reorder buf- 

30 fer and the functional unit to communicate a re- 

sult tag, the result tag being the destination tag of 
an operation when the operation is executed and 
the result Is available; wherein 

the functional unit includes a forwarding 

35 circuit to compare the result tag to the operand 

tag of a data dependent operation for each util- 
ized operand field and to forward the result data 
fields when the tags mutually correspond. 

40 27. A processor in accordance with daim 26, wherein 
the functional units execute operations in which 
operand fields are alternatively provided by: 
(1) the register file via the operand bus when 
the operand data is nonspeculatlve, 
45 (2) the reorder buffer via the operand bus 

when the operand data is speculative, or 
(3) the functional unit via a result bus upon 
generation of a previously unavailable result. 

50 28. A processor in accordance with daim 24, further 
including a load / store functional unit compris- 
ing: 

a data cache memory for storing several 
f bced bit width data operands; 
55 means, connected to the data cache menv 

ory to retrieve fixed bit width data operands, for 
clearing bits not activated by the field selector, re- 
mapping fields of the fixed bit width operands. 
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and combining data from different fields into the 
fixed bit width operand data; and 

an interface driver for communicating op> 
erand data to the result bus. 

5 

29. A processor in accordance with claim 24, where- 
in: 

the full-sized operand is 32 bits wide; 

the Instruction decoder, the reorder buffer, 
the register file, the functional unit, the operand io 
bus and the result bus operate upon data that is 
32 bits wide and 

the 32-bit full-sized operands are parti- 
tioned into three operand fields, a 16-bit high or- 
der field, an 8-bit middle order field and an 8-bit is 
tow order field. 

30. A processor which executes multiple operations 
concurrently utilizing variable size source and 
destination operands, the processor comprising: 20 

means for partitioning a full-sized operand 
bit-wise into a plurality of independent contiguous 
fields; 

means coupled to the partitioning means 
for designating the fields of the source and des- 25 
tination operands as defined or undefined with 
respect to an operation; 

an operand data memory; 

means coupled to the mennory and cou- 
pled to the designating means for accessing 30 
source operand data from the memory for each 
defined source operand field independently of 
other fields; 

means coupled to the accessing means for 
executing the operation utilizing the accessed 35 
source operand data to generate a result; and 

means coupled from the executing means 
to the memory for furnishing the result to the des- 
tination operand memory for each defined des- 
tination field Independently of other fields. 40 

31. A processor as in Claim 30, further comprising: 

means for detecting data dependencies 
for each defined source operand field indepen- 
dently of other fields; 45 

means for forwarding a result for each de- 
pendent source operand field independently of 
other fields when the result Is available; and 

means for executing the operation when 
results are forwarded for all dependent source so 
operand fields. 
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